Code
[1] 104 622
Week 2
Old Dominion University
[1] 104 622
What would be a good forecast, and why?
What is the “goal” or “objective”?
Does visualizing the data change your mind?
ggplot(data.frame(air), aes(x = air)) +
geom_density(fill = "lightblue", color = "lightblue") +
labs(x = "Count of Passengers", y = "Density", title = "") +
theme_minimal() + # Simpler background theme with x and y axis
theme(
plot.title = element_text(hjust = 0.5), # Centering title
axis.text.x = element_text(angle = 0, hjust = 1, color = "black"),
axis.text.y = element_text(color = "black")
)Point Forecast: a best guess (\(\hat{Y}\)) for an unknown value.
Forecast Distribution: a distribution, or range, of possibilities and their corresponding likelihoods/probabilities.
Point Forecast (again): a summary statistic of a forecast distribution
\[ e = y - \hat{y} \]
Will there always be forecast error?
So long as \(y\) is not deterministic, yes!
Of course, some errors are worse than others.
\[ L(e): \text{Loss Function} \]
What are some examples of Loss Functions?
Absolute Loss
\[L(e) = |e| \]
Quadratic Loss
\[L(e) = e^2 \]
What are some features (good / bad) about these Loss Functions?
Forecasts seek to minimize risk. Risk is defined as the expected loss
\[R(\hat{y}) = E[L(e)] = E[L(y - \hat{y})]\]
For quadratic loss:
\[ E[(y - \hat{y})^2] = E[y^2] - 2\hat{y}E[y] + \hat{y}^2\]
Remember, we know \(E[y]\). From here, we can set to minimize this function with respect to \(\hat{y}\).
\[E[y^2] - 2\hat{y}E[y] + \hat{y}^2 \implies 0 = 2E[y] - 2\hat{y}\]
After simplifying, we’ll have \(\hat{y} = E[y]\). So, the \(\hat{y}\) that minimizes our risk is the mean of the distribution of \(y\).
So, if \(\hat{y}^* = E[y]\), how do we calculate \(E[y]\)?
Continuous:
Discrete:
Empirical:
Suppose you flip a weighted coin with the following distribution:
\[P(y = T) = \frac{3}{7}\] \[P(y = H) = \frac{4}{7}\]
If the coin comes up tails, you are given $40. If the coin comes up heads, you owe $25.
What is the expected value of the coin flip? What is the expected “risk” (i.e. \(E[L(e)]\))?
Let’s use our shiny new optimal forecast.
# Convert the time series to a tibble
air_df <- data.frame(
date = seq(as.Date("1949-01-01"), as.Date("1960-12-01"), by = "months"),
count = as.numeric(air)
)
# Create a tibble for air2
air2_df <- tibble(date = seq(as.Date("1961-01-01"), as.Date("1962-12-01"), by = "months"),
count = mean(air))
# Plot using ggplot2
ggplot(air_df, aes(x = date, y = count)) +
geom_line() +
geom_line(data = air2_df, aes(x = date, y = count), col = "tomato", lwd = 1, linetype = "dashed") +
xlim(as.Date(c("1949-01-01", "1963-01-01"))) +
labs(x = "Date", y = "Count of Passengers") +
theme_minimal()This looks terrible, but let’s continue for now.
If our variable of interest is normally distributed, we can calculate a confidence interval for our predicted outcome.
\[ [\mu - \sigma Z_{\alpha/2}, \mu + \sigma Z_{\alpha/2}] \]
\[ [\hat{y} - \sigma Z_{\alpha/2}, \hat{y} + \sigma Z_{\alpha/2}] \]
\[ [280.3 - 120 * 1.96, 280.3 + 120 * 1.96] \]
\[ [45.1, 515.5] \]
# Create a sequence of dates for 1961-01 to 1962-12
dates <- seq(as.Date("1961-01-01"), by = "month", length.out = 24)
# Calculate the mean value and the standard deviation for the entire 'air' dataset
air_mean <- mean(air)
air_sd <- sd(air)
# Construct the dataframe
air2_df <- tibble(
date = dates,
mean = rep(air_mean, times = 24), # The mean is a constant for every row
upper = rep(air_mean + (1.96 * air_sd), times = 24),
lower = rep(air_mean - (1.96 * air_sd), times = 24)
)
# Plot using ggplot2
ggplot(air_df, aes(x = date, y = count)) +
geom_line() +
geom_line(data = air2_df, aes(y = mean), col = "tomato", lwd = 1, linetype = "dashed") +
geom_line(data = air2_df, aes(y = upper), col = "tomato", lwd = 1) +
geom_line(data = air2_df, aes(y = lower), col = "tomato", lwd = 1) +
labs(x = "Time", y = "Frequency") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5), # Centering title
axis.text.x = element_text(angle = 0, hjust = 1, color = "black"),
axis.text.y = element_text(color = "black")
)# Create a sequence of dates
timez <- seq(as_date("1949-01-01"), as_date("1962-12-01"), by = "month")
# Create the air3 tibble
air3 <- tibble(
time = timez,
air = c(as.vector(air), rep(NA, 24))
)
# Fit a constant model
reg <- lm(air ~ 1, data = air3)
# Add the predictions and confidence intervals to air3
air3 <- air3 %>%
mutate(
pred = predict(reg, newdata = air3),
pred_l = pred - (1.96 * sd(residuals(reg))),
pred_u = pred + (1.96 * sd(residuals(reg)))
)
# View the first few rows of the tibble
head(air3, 3)# A tibble: 3 × 5
time air pred pred_l pred_u
<date> <dbl> <dbl> <dbl> <dbl>
1 1949-01-01 112 280. 45.2 515.
2 1949-02-01 118 280. 45.2 515.
3 1949-03-01 132 280. 45.2 515.
## make the plot
ggplot(air3, aes(x = time, y = air)) +
geom_line() +
geom_line(data = filter(air3, time > ymd("1960-12-01")),
aes(y = pred), col = "tomato", linetype = "dashed") +
geom_line(data = filter(air3, time > ymd("1960-12-01")),
aes(y = pred_l), col = "tomato") +
geom_line(data = filter(air3, time > ymd("1960-12-01")),
aes(y = pred_u), col = "tomato") +
coord_cartesian(xlim = as.Date(c("1949-01-01", "1963-01-01")),
ylim = c(min(air3$pred_l, na.rm = TRUE), max(air3$air, na.rm = TRUE))) +
labs(x = "", y = "Frequency") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5), # Centering title
axis.text.x = element_text(angle = 0, hjust = 1, color = "black"),
axis.text.y = element_text(color = "black")
)Everything we’ve talked about thus far has been as true for cross section as for time series data.
Let’s talk about time now.
Time, or rather periodicity, can come in many forms:
This is called frequency.
A common feature of time series data is intertemporal dependency. For this, we have lags and leads.
Lags: \(y_{t-1}\), \(y_{t-2}\), \(y_{t-k}\)
Leads: \(y_{t+1}\), \(y_{t+2}\), \(y_{t+k}\)
Sample: \(\{y_{1}, y_{2}, y_{3}, ..., y_{T}\}\)
Periods:
Sample: \(\{y_{1}, y_{2}, y_{3}, ..., y_{T}\}\)
Out of Sample: \(\{y_{T+1}, y_{T+2}, ..., y_{T+h}\}\) where \(h\) is the horizon
Notation:
\(\hat{y}\) : \(y\) \(\implies\) \(\hat{y_t}\) : \(y_t\)
\(\hat{y_t}\) : \(y_t\) \(\implies\) \(\hat{y_{T+h}}\) : \(y_{T+h}\)
However, these are all unclear.
\(\hat{y_{T+h|t}}\), \(y_{T+h|t}\)
Conditioning with relevant variables will improve forecast (i.e. reduce risk)
Both point forecasts and forecast intervals are functions of conditioning variables.
Let’s call all available information \(\Omega_t\).
library(gridExtra)
b <- read.csv("../data/bball_allstars.csv")
# Plot for all basketball players
p1 <- ggplot(b, aes(x = HEIGHT)) +
geom_density(fill = "dodgerblue", alpha = 0.6) +
geom_vline(aes(xintercept = mean(HEIGHT)), color = "black") +
coord_cartesian(xlim = c(66, 85), ylim = c(0, .12)) +
labs(title = "Height of Pro Basketball Players",
x = paste("Mean:", round(mean(b$HEIGHT), 1)),
y = "Density") +
theme_minimal()
# Plot separating by League
p2 <- ggplot(b, aes(x = HEIGHT, fill = LEAGUE)) +
geom_density(alpha = 0.5) +
geom_vline(aes(xintercept = mean(HEIGHT[LEAGUE == "WNBA"])), color = "tomato", linetype = "dashed") +
geom_vline(aes(xintercept = mean(HEIGHT[LEAGUE == "NBA"])), color = "dodgerblue", linetype = "dashed") +
geom_vline(aes(xintercept = mean(HEIGHT)), color = "black") +
coord_cartesian(xlim = c(66, 85), ylim = c(0, .12)) +
labs(title = "Height of Pro Basketball Players",
x = paste("Mean WNBA:", round(mean(b$HEIGHT[b$LEAGUE == "WNBA"]), 1), "| Mean NBA:", round(mean(b$HEIGHT[b$LEAGUE == "NBA"]), 1)),
y = "Density") +
theme_minimal() +
scale_fill_manual(values = c("WNBA" = "tomato", "NBA" = "dodgerblue")) +
theme(legend.position = "topright")
# Print plots
grid.arrange(p1, p2, ncol=2)# Adding the 'letter' column to the data
b <- b %>%
mutate(letter = ifelse(substr(PLAYER, 1, 1) %in% letters[1:10], "A-J", "K-Z"))
# Combine data for plotting
p1 <- ggplot(b, aes(x = HEIGHT)) +
geom_density(fill = "dodgerblue", alpha = 0.6) +
geom_vline(aes(xintercept = mean(HEIGHT)), linetype = 2, color = "black") +
labs(
title = "Height of Pro Basketball Players",
x = paste("Mean:", round(mean(b$HEIGHT), 1)),
y = "Density"
) +
coord_cartesian(xlim = c(65, 90), ylim = c(0, 0.12)) +
theme_minimal()
p2 <- ggplot(b, aes(x = HEIGHT, fill = letter)) +
geom_density(alpha = 0.6) + # to scale the densities so they fit into the same y-axis
geom_vline(aes(xintercept = mean(HEIGHT)), linetype = 2, color = "black") +
geom_vline(data = filter(b, letter == "A-J"), aes(xintercept = mean(HEIGHT)), color = "dodgerblue", linetype = 2) +
geom_vline(data = filter(b, letter == "K-Z"), aes(xintercept = mean(HEIGHT)), color = "tomato", linetype = 2) +
labs(
title = "Height of Pro Basketball Players",
x = paste("Mean A-J:", round(mean(b$HEIGHT[b$letter == "A-J"]), 1), "| Mean K-Z:", round(mean(b$HEIGHT[b$letter == "K-Z"]), 1)),
y = "Density"
) +
scale_fill_manual(values = c("dodgerblue", "tomato")) +
coord_cartesian(xlim = c(65, 90), ylim = c(0, 0.12)) +
theme_minimal() +
theme(legend.position = "top")
# Print plots
grid.arrange(p1, p2, ncol=2)Just like with cross sectional settings, we may (or may not) observe some additional variables. However, we still need to model the Data Generating Process (DGP).
Data are not a choice, per se, but you do choose functional forms and specifications.
Once we select a model, we need to estimate the model’s parameters with data.
\[y_t = f(y_{t-1}, x_{t}, x_{t-1}, \tau_t, C_t, S_t)\]
\[y_t = f(y_{t-1}, x_{t}, x_{t-1}, \tau_t, C_t, S_t)\]
\[y_t = \tau_t + C_t + S_t\]
Trend
Cycle
Season
It is useful to consider these things separately (additively)
air_df <- tibble(date = seq(as.Date("1949-01-01"), as.Date("1960-12-01"), by = "months"),
count = as.numeric(air))
# Plot using ggplot2
ggplot(air_df, aes(x = date, y = count)) +
geom_line() +
labs(x = "Month", y = "Count of Passengers") +
theme_minimal() +
theme(legend.title = element_blank(),
legend.position = "topright",
legend.text = element_text(size = 12))The simplest models has an intercept, but no trend, cycle or season.
\[E[y_{t+h}|\Omega_t] = \beta_0\]
This is simple, but what type of setting would this be appropriate for?
ggplot(consume, aes(x = DATE, y = PCEC96_PC1)) +
geom_line(lwd = 1) +
geom_hline(yintercept = 0, linetype = 2) +
geom_hline(yintercept = mean(consume$PCEC96_PC1), color = "dodgerblue", lwd = 1) +
labs(
x = "Month",
y = "Percent Change (Year over Year)",
title = "Real Personal Consumption Expenditures (PCEC96)"
) +
theme_minimal()consume_filtered <- consume %>%
filter(DATE < ymd("2020-01-01"))
ggplot(consume_filtered, aes(x = DATE, y = PCEC96_PC1)) +
geom_line() +
geom_hline(yintercept = 0, linetype = 2) +
geom_hline(yintercept = mean(consume_filtered$PCEC96_PC1), color = "dodgerblue", lwd = 1) +
labs(
x = "Month",
y = "Percent Change (Year over Year)",
title = "Real Personal Consumption Expenditures (PCEC96)"
) +
theme_minimal() +
coord_cartesian(xlim = c(min(consume$DATE), max(consume$DATE))) # setting xlim to the range of the original dataconsume_filtered <- consume %>%
filter(DATE < ymd("2020-01-01"))
mean_m <- mean(consume_filtered$PCEC96_PC1)
sd_m <- sd(consume_filtered$PCEC96_PC1)
ggplot(consume, aes(x = DATE, y = PCEC96_PC1)) +
geom_line(data = consume_filtered) +
geom_hline(yintercept = 0, linetype = 2) +
# Point forecast line
geom_segment(aes(x = ymd("2020-01-01"), xend = max(consume$DATE),
y = mean_m, yend = mean_m, linetype = "Point Forecast"),
color = "mediumseagreen", lwd = 1) +
# Upper CI line
geom_segment(aes(x = ymd("2020-01-01"), xend = max(consume$DATE),
y = mean_m + 1.645 * sd_m, yend = mean_m + 1.645 * sd_m, linetype = "90% CI"),
color = "mediumseagreen", lwd = 1) +
# Lower CI line
geom_segment(aes(x = ymd("2020-01-01"), xend = max(consume$DATE),
y = mean_m - 1.645 * sd_m, yend = mean_m - 1.645 * sd_m, linetype = "90% CI"),
color = "mediumseagreen", lwd = 1) +
labs(
x = "Month",
y = "Percent Change (Year over Year)",
title = "Real Personal Consumption Expenditures (PCEC96)"
) +
theme_minimal() +
theme(legend.position = "bottom") +
coord_cartesian(xlim = c(min(consume$DATE), max(consume$DATE))) +
# Add the legend
scale_linetype_manual(
name = "",
values = c("Point Forecast" = 1, "90% CI" = 3)
)Forecast Errors
\(e_t = y_{t+h} - E[y_{t+h}|\Omega_t] \implies y_{t+h} = E[y_{t+h}|\Omega_t] + e_t\)
Residuals
\(\hat{e_t} = y_{t+h} - \hat{y}_{t+h} = y_{t+h} - \beta_0\)
It may be helpful to plot the residuals over time to see if there are any remaining patterns.
Forecast Variance
Trend Models
ECON 707/807: Econometrics II